A Quality-Aware Optimizer for Information Extraction
ثبت نشده
چکیده
Large amounts of structured information is buried in unstructured text. Information extraction systems can extract structured relations from the documents and enable sophisticated, SQL-like queries over unstructured text. Information extraction systems are not perfect and their output has imperfect precision and recall (i.e., contains spurious tuples and misses good tuples). Typically, an extraction system has a set of parameters that can be used as “knobs” and tune the system to be either precisionor recall-oriented. Furthermore, the choice of documents processed by the extraction system also affects the quality of the extracted relation. So far, estimating the output quality of an information extraction task was an ad-hoc procedure, based mainly on heuristics. In this paper, we show how to use receiver operating characteristic (ROC) curves to estimate the extraction quality in a statistically robust way and show how to use ROC analysis to select the extraction parameters in a principled manner. Furthermore, we present analytic models that reveal how different document retrieval strategies affect the quality of the extracted relation. Finally, we present our maximum likelihood approach for estimating—on the fly—the parameters required by our analytic models to predict the run time and the output quality of each execution plan. Our experimental evaluation demonstrates that our optimization approach predicts accurately the output quality and selects the fastest execution plan that satisfies the output quality restrictions.
منابع مشابه
Felix: Scaling up Global Statistical Information Extraction Using an Operator-based Approach
To support the next generation of sophisticated information extraction (IE) applications, several researchers have proposed frameworks that integrate SQL-like languages with statistical reasoning. While these frameworks demonstrate impressive quality on small IE tasks, they currently do not scale to enterprise-sized tasks. To enable the next generation of IE, a promising approach is to improve ...
متن کاملPro le Aware Retrieval Optimizer forContinuous Media
One of the key components of multimedia systems is a Continuous Media (CM) server that guarantees the uninterrupted delivery of continuous media data (i.e., audio and video). Queries imposed by applications, such as customized news-on-demand, might require the retrieval of one or more continuous objects from the CM server. Traditionally, multimedia systems have opted to guarantee that the CM se...
متن کاملSystemT: An Algebraic Approach to Declarative Information Extraction
As information extraction (IE) becomes more central to enterprise applications, rule-based IE engines have become increasingly important. In this paper, we describe SystemT, a rule-based IE system whose basic design removes the expressivity and performance limitations of current systems based on cascading grammars. SystemT uses a declarative rule language, AQL, and an optimizer that generates h...
متن کاملSystemT: A Declarative Information Extraction System
Emerging text-intensive enterprise applications such as social analytics and semantic search pose new challenges of scalability and usability to Information Extraction (IE) systems. This paper presents SystemT, a declarative IE system that addresses these challenges and has been deployed in a wide range of enterprise applications. SystemT facilitates the development of high quality complex anno...
متن کاملWhen Speed Has a Price: Fast Information Extraction Using Approximate Algorithms
A wealth of information produced by individuals and organizations is expressed in natural language text. This is a problem since text lacks the explicit structure that is necessary to support rich querying and analysis. Information extraction systems are sophisticated software tools to discover structured information in natural language text. Unfortunately, information extraction is a challengi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008